-*-mode:org-*-

M2-Planet being based on the goal of bootstrapping the Minimal C compiler
required to support structs, arrays, inline assembly and self hosting;
is rather small, under 1.7Kloc according to sloccount

* SETUP
The most obvious way to setup for M2-Planet development is to clone and setup mescc-tools first (https://github.com/oriansj/mescc-tools.git)
Then be sure to install any C compiler and make clone of your choice.

* BUILD
The standard C based approach to building M2-Planet is simply running:
make M2-Planet

Should you wish to verify that M2-Planet was built correctly run:
make test

* ROADMAP
M2-Planet V1.0 is the bedrock of all future M2-Planet versions. Any future
release that will depend upon a more advanced version to be compiled, will
require the version prior to it to be named. V2.0 and the same properties apply
To all future release of M2-Planet. All minor releases are buildable by the last
major release and All major releases are buildable by the last major release.

* DEBUG
To get a properly debuggable binary: make M2-Planet-gcc
However if you are comfortable with gdb, knowing that function names are
prefixed with FUNCTION_ the M2-Planet binary is quite debuggable.

* Bugs
M2-Planet assumes a very heavily restricted subset of the C language and many C
programs will break hard when passed to M2-Planet.

M2-Planet does not actually implement any primitive functionality, it is assumed
that will be written in inline assembly by the programmer or provided by the
programmer durring the assembly and linking stages

* Magic
** argument and local stack
In M2-Planet the stack is first the EDI pointer which is preserved as should an
argument be a function which returns a value, it may be overwritten and cause
issues, this is followed by the previous frame's base pointer (EBP) as it will
need to be restored upon return from the function call. This is then followed by
the arguments which are pushed onto the stack from the left to the right,
followed by the RETURN Pointer generated from the function call, after which the
locals are placed upon the stack first to last followed by any Temporary values:
       +----------------------+
EDI -> | Previous EDI pointer |
       +----------------------+
EBP -> | Previous EBP pointer |
       +----------------------+
1st -> | Argument 1           |
       +----------------------+
2nd -> | Argument 2           |
       +----------------------+
... -> ........................
       +----------------------+
Nth -> | Argument N           |
       +----------------------+
RET -> | RETURN Pointer       |
       +----------------------+
1st -> | Local 1              |
       +----------------------+
2nd -> | Local 2              |
       +----------------------+
... -> ........................
       +----------------------+
Nth -> | Local N              |
       +----------------------+
temps-> .......................

** AArch64 port notes
Some details about design, implementation and generated code; maybe of
interest for new targets, to M1 users, compiler hackers and curious
minds in general.

*** Target ISA related issues

In the ARMv8 AArch64 A64 instruction set that we target, immediate
values into instructions are not aligned to 4 bits, which is the size
of the convenient single hexadecimal digit (that served well so far,
for other ports). Other groups of bits are affected. For example,
those to encode registers are usually 5 bits long and horror stories
about non-contiguous chunks (due to endianess interactions with M1, a
big bit endian language) are told, so not even using octal nor binary
encodings solve our problem.

Because of that, we have less flexible and reusable definitions than
usual (see aarch64_defs.M1). Also, we resort to unconventional (for
M2-Planet standards) workarounds and generate worse code. Anyway,
neither size nor speed are high priorities and there's room for
improvement.

On the bright side, affected codepaths/definitions and working tactics
are better known now, being this the first target of M2-Planet with
such features. That might be helpful in future ports (RISC-V comes to
mind, which has weird structure too... designed "so that as many bits
as possible are in the same position in every instruction" but not for
basic tools).

Some notable workarounds are:

- Create one independent definition per _needed_ operation, instead of
  reusing common parts like we do for other archs. The resulting set is
  quite small even following this simple rule consistently. See how
  the SKIP_INST_* family seems nicely aligned for more fine-grained
  hex but we don't exploit that; or the PUSH/POP ones that also kind
  of do, but watch out for the general case if you plan to create your
  own set of general purpose definitions.

One interesting example reflects that creating new definitions is
avoided unless readability suffers: the pair LOAD_W2_AHEAD,
LSHIFT_X0_X0_X2 exists because our two main registers are in use in
postfix_expr_array() and the common shift is inconvenient in this
particular case. It's possible to reuse definitions (preliminary
patches did this) using multiplication and addition (quite natural by
the way, even if suboptimal); or dancing with the stack to fit
everything into place (harder to reason about). It felt too alien in
the codebase so a couple of definitions were added.

- Use the register-based instructions instead of those using
  immediates. This forces us to generate more code in order to put the
  data in the register. Data is mixed with the code (not even in a
  fancy pool) to be loaded from and then skipped at run-time. See some
  of the multiple instances of the LOAD_W0_AHEAD then SKIP_32_DATA
  pattern.

- For control flow structures, the problem about immediates bits us
  again (hits, bites, bytes; sorry, can't resist) for conditional
  PC-relative branching. The jump is arbitrary, because any amount of
  code can be present in any given block to be skipped. AArch64
  PC-relative conditional branch instructions [that I found, newbie on
  board!] are based on immediate values, and we have to avoid
  arbitrary immediate values as usual.

There's an *unconditional* absolute branch instruction that accepts
the target addr from a register (which we can set at will using the
"load_ahead+skip" pattern). So, we construct an unconditional
over-the-block jump and skip this jump with the conditional one
("inverted", more about this in a moment). The point is that now we
know exactly the distance to jump: it's the size of that
construction. We can define a couple of conditional branch
instructions because the immediate is not arbitrary anymore, nice!

Maybe this pseudo-code explains it better:

  if(cond) block_foo; else block_bar;
  more;

... is compiled to:

  if cond then skip past the unconditional-branch // To get to foo-code.
  // We know the space used by this code...
  set register to addr of else-label
  // ... and this one, that completes the jump to the alternative block.
  unconditional-branch to addr in register

  foo-code
  [Here we jump to the endif-label, omitted for clarity.]
else-label:
  bar-code
endif-label:
  more-code

Similar approach is used for other control flow structures. See
CBZ_X0_PAST_BR (cbz x0, #20) and CBNZ_X0_PAST_BR (cbnz x0, #20) used
as part of the generation of 'if', 'for', 'do' and 'while'
statements. Notice how the test is inverted: when Knight does JUMP.Z
we do CBNZ (process_if); when JUMP.NZ we CBZ (process_do).

CSEL was considered but required an additional register, more labels
and code. A bit too invasive a change to make to the codebase.

As you can imagine, the ISA colored the port development from the very
beginning. It's a lot of fun to come up with basic solutions under
those limitations. The port works as expected but there's room for
experimentation.

*** Function call

The Base Pointer and its relation to arguments in function calls and
locals during function execution is a bit different compared to other
supported architectures. This simplifies some calculations. See how
unsurprising the depths are in collect_arguments() and
collect_local().

Note how this calculations are related to the "push/pop size". See
`Wasted stack space`.

Let's follow a couple of M2-Planet functions generating code for
prologue, call and epilogue with the help of some artsy-less ascii-art
stack graphs for clarity. The expected stack is "full" (the stack
pointer register contains the address of the last pushed element) and
descending (grows towards zero).

Most of the work is done by function_call(). First, we save (the
generated code does it at runtime of the compiled program, but please
bear with me about the point of view) three registers on the stack. We
include a scratch one ("tmp" value in the graphs) that we're going to
use for two different purposes. On the one hand, to store the actual
stack pointer (which is going to be the reference address --Base
Pointer-- during the execution of the called function). On the other
hand, when the BP is already set (which can't be done right now
because we need the actual BP to evaluate the arguments in caller
context) we use the register to store the addr of the function to be
called. The other two registers are the Link Register (X30) and Base
Pointer (X17 also know as IP1) itself, to allow for recursion. Both
are prefixed with "o" in the following graphs, as in "old".

This structure gives us a simple reference for both the args and the
locals, without extra elements between those two sets. We rely on the
semantics of BLR (more on this in a bit) which doesn't use the stack
to save the return address, but a register. For other archs this is
not possible (or not exploited, see how for ARM-7 the LR is saved in
the stack just around the call proper; this puts it between the args
and the locals) so it's a difference worth documenting.

                                 ---> Address 0
tmp | oLR | oBP |
                ^
                |
                --- SP
                |
                --- BP-to-be

Now we're ready to evaluate and push arguments. Note that M2-Planet
doesn't follow AAPCS64. The evaluation might involve function calls
itself and arbitrary use of the stack, but everything will be like
this after all.

tmp | oLR | oBP | arg1 | arg2 | ... | argN |
                ^                          ^
                |                          |
                --- BP-to-be               --- SP (omitted from now on)

At this point we set the BP from the scratch register and execute
branch-and-link (BLR) to the function reusing the (now free) X16
register (also know as IP0). This instruction saves the address of the
next instruction on X30 (LR, which we saved earlier to allow for
recursion).

tmp | oLR | oBP | arg1 | arg2 | ... | argN |
                ^
                |
                --- BP

During the called function the locals are pushed on the stack as usual
in M2-Planet.

tmp | oLR | oBP | arg1 | arg2 | ... | argN | loc1 | loc2 | ... | locN |
                ^
                |
                --- BP

When the function is about to return, we remove the locals from the
stack and execute the return proper, jumping to the address in LR
thanks to RET. This is handled by return_result().

tmp | oLR | oBP | arg1 | arg2 | ... | argN |
                ^
                |
                --- BP

Back in function_call() we remove the args from the stack.

tmp | oLR | oBP |
                ^
                |
                --- BP

Finally, we restore the saved registers (so X16, LR and BP contain
tmp, oLR and oBP again) leaving everything as it was before this
journey. Well... one important thing changed: following M2-Planet
conventions the value returned from the function, if any, is on X0.

*** Stack pointer

Due to alignment (128 bits) restriction for "push" and "pop" based on
the architectural register, we initialize and use X18 as stack pointer
instead.

The M1 definitions referring to SP use X18; stack operations too.

For example:

DEFINE LDR_X0_[SP] 400240f9 is ldr x0, [x18]
DEFINE PUSH_LR 5e8e1ff8 is str x30, [x18, #-8]!
DEFINE INIT_SP f2030091 is mov x18, sp
